Skip to content

Support validation of collection params with record types#7025

Merged
bentsherman merged 2 commits intomasterfrom
record-collection-params
Apr 28, 2026
Merged

Support validation of collection params with record types#7025
bentsherman merged 2 commits intomasterfrom
record-collection-params

Conversation

@bentsherman
Copy link
Copy Markdown
Member

This PR extends the support for loading a collection-type param from a CSV/JSON/YAML file to also validate and convert against a record type when used.

Notable changes:

  • Add ADR for workflow params (with appendix on runtime type analysis)
  • Compile params block as a hidden class to preserve runtime type information
  • Validate and convert collection elements against record type when specified
  • Add custom @Nullable annotation to model nullable record fields at runtime
  • Extend Types helper class to render parameterized types

@bentsherman bentsherman requested a review from jorgee April 10, 2026 21:39
@bentsherman bentsherman requested a review from a team as a code owner April 10, 2026 21:39
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 10, 2026

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit c3b1ed9
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69f108094361700008066088
😎 Deploy Preview https://deploy-preview-7025--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@pditommaso pditommaso self-requested a review April 14, 2026 09:44
@jorgee
Copy link
Copy Markdown
Contributor

jorgee commented Apr 14, 2026

Suggestion from AI review:

1. Null-safety bug in TypeHelper.asType for nullable Path fields

In modules/nf-commons/src/main/nextflow/util/TypeHelper.groovy, if a record has a nullable Path field (e.g., fastq_2: Path?) and the input value is null, then value.toString() at line ~99 will throw a
NullPointerException. The validation loop in asRecordType skips nullable fields, but the conversion loop still calls asType(entry.value, ...) when entry.value is null. Needs a null guard at the top of asType.

2. No unit tests for TypeHelper conversion methods

getRawType, isCollectionType, asType, asCollectionType, and asRecordType have no dedicated unit tests — only tested indirectly through ParamsDslTest. Given these methods handle type reflection and conversion (prone
to subtle bugs), they should have their own tests covering edge cases: null values, nested parameterized types, unknown types, etc.

3. No test for @nullable annotation propagation

There's no test verifying that a ?-suffixed record field actually gets the @nullable annotation at runtime, or that asRecordType correctly allows missing values for nullable fields. Only the rejection path
(non-nullable missing field) is partially covered.


- Replacing `nextflow_schema.json` -- while the `params` block addresses many of the same needs, existing pipelines that use a JSON Schema should not be required to migrate. A native integration with `nextflow_schema.json` can be explored in the future.

- Supporting nested params -- the `params` block only supports a flat list of params. Nested params can still be used in the config, but they do not have first-class support at this time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record types are considered nested params, right? I mean , can I define a param as a Record type such as the following?:

params {
    meta : Metadata
}
record Metadata {
    id: String
    ....
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be the correct way to support a "nested" param, although I prefer to not call a record param a nested param because they have slightly different semantics

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to run a workflow with a record type defined as paramer as above but I can't find the way to pass the value. I have tried with the following:

nextflow run main.nf --meta.id='hola'

But getting the following error. It seems that a Map is not assignable to a Record type.

Parameter meta with type Metadata cannot be assigned to [id:hola] [LinkedHashMap]

am I doing it correctly? Is it out of the scope of the PR? or should it work?

If it is out of the scope, it is strange that a list records is supported but not a single one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I think it's out of scope. The main goal here is to support collections of records. We can looking into supporting a basic "record" param, but I think there are still some unresolved details

@bentsherman bentsherman requested a review from jorgee April 17, 2026 16:32
@bentsherman
Copy link
Copy Markdown
Member Author

  • Fixed null-safety bug
  • Added unit tests for TypeHelper

return c;
if( type instanceof ParameterizedType pt )
return (Class) pt.getRawType();
throw new IllegalStateException();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't include a message with a reason of the exception?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this should never happen, so i just throw an illegal state exception to capture the stack trace, following the convention used by the rest of the codebase

Comment thread adr/20250825-workflow-params.md Outdated
Copy link
Copy Markdown
Contributor

@jorgee jorgee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of comments:

  • All tests are with List. It works in this case but not when using custom record types (e.g. List) as in the examples in the ADR and documentation.
    I have also missed a test with a custom type that includes another custom type or List<Path> in an inner property.
  • Minor issue about an exception without a reason message

@pditommaso
Copy link
Copy Markdown
Member

Some initial thoughts and a few concerns about the direction of this adr/implementation:

  1. Embedding file parsing and type binding into the core runtime

CSV/JSON/YAML parsing and the type-binding rules (file-path columns, nullable handling, record construction) are being baked directly into the core. This is a substantial surface to commit to — CSV alone carries quoting, separator, encoding, and header-inference choices — and once it's in core, every edge case and format request becomes a Nextflow release decision.

It also sets a narrow precedent: users will reasonably expect the same typed-binding experience for other formats that are increasingly common in scientific and data workflows — Parquet, Iceberg, Arrow, Avro, and so on. Committing to CSV/JSON/YAML in core makes each of those future additions a core concern too. This feels like a responsibility that should sit behind a built-in function, factory, or a core plugin rather than in the runtime itself.

  1. Schemas evolve — hand-maintaining them in Groovy is cumbersome

Input schemas (samplesheets, JSON Schema definitions, vendor/LIMS formats) are maintained upstream and change over time. Requiring authors to hand-transcribe them into Groovy record declarations creates a dual-maintenance problem with no tooling to catch drift, and it pushes schema ownership onto people who don't necessarily write Groovy. As a developer I'd rather be able to import a JSON Schema — or even a CSV file — and have it bind automatically to a Nextflow record.

  1. Alignment with existing workflow input definitions

More broadly, I think it's worth exploring how the params model could fit into the existing sub-workflow input definition rather than standing alongside it as a parallel construct. There may be an opportunity here to enable workflow composition and chaining along the same lines we already do for sub-workflows — this deserves a closer look before the two mechanisms diverge further.

I think these points are worth a deeper discussion before the params block becomes a hard commitment. Happy to dig into any of them further.

@bentsherman
Copy link
Copy Markdown
Member Author

One of the unit tests had List<Record instead of List<Sample> and so it wasn't capturing the error you encountered. That is now fixed

I also added a more complex unit test that converts a record containing a list of records

@bentsherman bentsherman requested a review from jorgee April 21, 2026 19:28
@bentsherman
Copy link
Copy Markdown
Member Author

CSV/JSON/YAML parsing and the type-binding rules (file-path columns, nullable handling, record construction) are being baked directly into the core. This is a substantial surface to commit to — CSV alone carries quoting, separator, encoding, and header-inference choices — and once it's in core, every edge case and format request becomes a Nextflow release decision.

CSV is indeed a tricky one. For this reason, the documentation states that the CSV must have a header row and use a comma separator, so that there is no ambiguity. I would be open to only supporting JSON in core, but CSV samplesheets are so ubiquitous that I figure it's better to support it

JSON/YAML don't have these problems though, I think the behavior for them is quite clear

It also sets a narrow precedent: users will reasonably expect the same typed-binding experience for other formats that are increasingly common in scientific and data workflows — Parquet, Iceberg, Arrow, Avro, and so on. Committing to CSV/JSON/YAML in core makes each of those future additions a core concern too. This feels like a responsibility that should sit behind a built-in function, factory, or a core plugin rather than in the runtime itself.

I agree. I would like to make this functionality extensible, with only CSV/JSON/YAML supported in the core runtime (or even just JSON). We should be able to do that in 26.10. Right now I'm focused on the basic functionality

Input schemas (samplesheets, JSON Schema definitions, vendor/LIMS formats) are maintained upstream and change over time. Requiring authors to hand-transcribe them into Groovy record declarations creates a dual-maintenance problem with no tooling to catch drift, and it pushes schema ownership onto people who don't necessarily write Groovy. As a developer I'd rather be able to import a JSON Schema — or even a CSV file — and have it bind automatically to a Nextflow record.

These record types will need to be defined in pipeline code either way, in order to model the data as it flows through the pipeline. So it makes more sense for these record types to be the source of truth and to generate JSON schemas from the record types.

External tooling like Seqera Platform can provide pre-launch validation, e.g. comparing a pipeline input schema with upstream dataset schema and reporting any mismatch in the launch form.

More broadly, I think it's worth exploring how the params model could fit into the existing sub-workflow input definition rather than standing alongside it as a parallel construct. There may be an opportunity here to enable workflow composition and chaining along the same lines we already do for sub-workflows — this deserves a closer look before the two mechanisms diverge further.

It's a bit late since we already made the params block stable in 25.10. Even so, what you are saying mostly comes down to syntax -- whether to define an entry workflow as a trio of params / workflow / output or as a subworkflow with some implicit behavior. Either way, you still need to be able to translate between Nextflow data types and the external world.

@bentsherman
Copy link
Copy Markdown
Member Author

To put it concisely -- loading a list of records from a CSV/JSON/YAML file in the params block is the mirror image of saving an output channel to a CSV/JSON/YAML index file in the output block.

I'm open to which file formats are supported in core, but I think both directions should be supported and, ultimately, both directions should be extensible so that people can integrate with arbitrary file formats via plugin.

@bentsherman bentsherman added this to the 26.04 milestone Apr 22, 2026
Copy link
Copy Markdown
Contributor

@jorgee jorgee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last changes look fine to me

Copy link
Copy Markdown
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both directions should be supported and, ultimately, both directions should be extensible so that people can integrate with arbitrary file formats via plugin.

I'm not sure this should go in the core, I'm not ok to merge

@bentsherman
Copy link
Copy Markdown
Member Author

@pditommaso did you read my previous comment? I don't understand why you are blocking this

@pditommaso
Copy link
Copy Markdown
Member

Let me clarify more this, I'm totally supporting static types for params and compile-time validation.

My concerns is essentially about overloading and embedding data binding and type conversion into core functions/factories like fromList when showing

// after (CSV, JSON, or YAML)
ch_samples = channel.fromList(param.samples)

This functionality is some extend is not directly related to this pr/adr and in my view should be provided via one or more new factory methods or a core plugins dedicated data loading/import/mapping

@bentsherman bentsherman linked an issue Apr 28, 2026 that may be closed by this pull request
@bentsherman
Copy link
Copy Markdown
Member Author

In that example when you call channel.fromList(params.samples), the params.samples is already converted to List<Sample>. So the fromList is only converting List<Sample> to Channel<Sample>, same as before.

The "magic" happens in the params block when the user supplies a file path instead of a literal list. The file parsing bit was originally implemented in #6675 and this PR adds the record type validation

I can remove the file parsing piece for now if you want more time to investigate it before we commit to it

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman requested a review from a team as a code owner April 28, 2026 19:18
@bentsherman
Copy link
Copy Markdown
Member Author

Updated based on our discussion:

@bentsherman bentsherman merged commit 297172c into master Apr 28, 2026
25 checks passed
@bentsherman bentsherman deleted the record-collection-params branch April 28, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants